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ABSTRACT 

In  this  paper  we  present  our  work  on  the  ad-hoc  search  and  the 
tweet  timeline  generation  (TTG)  tasks  of  TREC-2014  Microblog 
track.  Regarding  the  ad-hoc  search  task,  we  used  our  best 
developed  system  over  the  last  year,  which  include  hyperlink- 
based  query  expansion  and  re-ranking  models  fusion.  For  the  new 
tweet  timeline  generation  task,  we  applied  a  straightforward  and 
simple  approach,  which  depends  on  clustering  retrieval  results 
based  on  Jaccard  similarities  between  tweets.  Our  best  adhoc 
results  achieved  the  fifth  rank  and  seventh  rank  among  21 
participating  groups  when  evaluated  using  P@30  and  MAP 
respectively.  However,  our  best  TTG  run  achieved  the  second 
rank  among  participants,  which  shows  that  our  simple  TTG 
approach  was  more  effective  than  most  of  the  used  TTG  systems 
in  TREC. 

1.  INTRODUCTION 

We  describe  the  participation  of  Qatar  Computing  Research 
Institute  (QCRI)  group  in  the  TREC-2014  Microblog  track.  This 
year  the  track  included  two  tasks;  the  ad-hoc  search  task,  and  the 
newly  introduced  tweets  timeline  generation  (TTG)  task.  We 
applied  what  we  have  learned  from  our  participation  in  the  track 
in  the  past  three  years  in  the  ad-hoc  task,  which  include  hyperlink- 
based  query  expansion  methods  [4,  13]  and  the  selection  and 
fusion  of  multiple  re-ranking  models  [4,  5].  We  configured  our 
retrieval  system  according  to  the  best  results  achieved  when  tested 
on  the  topics  of  2013  [4,  5,  13],  since  it  is  the  same  collection 
used  this  year  but  with  new  topics  set. 


Figure  1  Ad-hoc  search  system 


We  submitted  four  runs  for  the  ad-hoc  task  while  enabling  and 
disabling  hyperlink-based  pseudo  relevance  feedback  (HPRF)  and 
reranking.  The  run  which  applied  both  HPRF  and  reranking  was 
then  used  in  the  TTG  task  by  clustering  the  results  according  to 
similarity. 

For  the  TTG  task,  since  it  is  running  for  the  first  year,  we  decided 
to  keep  it  simple  and  straightforward  (KISS)  by  using  a  simple 
implementation  of  Jaccard  similarity  to  measure  the  distance 
between  tweets  in  the  top  N  retrieved  results  and  cluster  those  of 
high  similarity  together.  Four  runs  was  submitted  for  the  TTG 


task  by  using  different  values  for  N,  and  applying  two  different 
formulas  for  calculating  the  similarity  between  tweets. 

Although  our  best  ad-hoc  run  achieved  the  seventh  rank  among 
participants,  but  when  this  run  was  applied  to  our  TTG  system, 
our  best  TTG  system  achieved  the  second  rank.  This  shows  the 
effectiveness  of  our  simple  TTG  approach  that  outperformed  most 
the  systems  of  the  other  groups  that  used  better  lists  of  retrieved 
results. 

Details  and  results  of  our  runs  are  described  below. 
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Figure  2  Demonstration  for  Condorcet-fuse  algorithm 


2.  AD-HOC  SEARCH  TASK 

Figure  1  presents  the  full  architecture  of  our  microblog  ad-hoc 
retrieval  system. 

Overall,  we  designed  our  pipeline  to  combine  query  expansion 
and  result  re-ranking.  For  query  expansion,  we  made  use  of  the 
external  documents  linked  by  the  URLs  in  the  initial  search  results 
for  query  expansion.  For  result  re-ranking,  our  system  resorted  to 
learning  to  rank  by  extensive  engineering  work  for  re-ranking 
search  results  given  by  combining  the  ranked  lists  of  different 
rankers. 

2.1  Hyperlink-based  Pseudo  Relevance 
Feedback  (HPRF) 

A  hyperlink  in  a  tweet  is  more  than  a  link  to  related  content  as  in 
webpages,  but  actually  it  is  considered  a  link  to  the  main  focus  of 
the  tweet.  In  fact,  sometimes  tweet’s  text  itself  is  totally 
irrelevant,  and  the  main  content  lies  in  the  embedded  hyperlink, 
e.g.  “This  is  really  amazing,  you  have  to  check  htxvins.net/scale2” . 

Analyzing  the  TREC  microblog  dataset  over  the  past  three  years, 
we  found  more  than  70%  of  relevant  tweets  contain  hyperlinks. 
This  motivates  utilizing  the  hyperlinked  documents  content  in  an 
efficient  way  for  query  expansion. 

The  content  of  hyperlinked  documents  in  the  initial  set  of  top 
retrieved  tweets  is  extracted  and  integrated  into  the  PRF  process. 
Titles  of  hyperlinked  pages  usually  act  like  heading  of  the 
document’s  content,  which  can  enrich  the  vocabulary  in  the  PRF 
process. 

We  apply  hyperlinked  documents  content  extraction  on  two 
different  levels: 

Tweets  level  (PRF):  which  represents  the  traditional 
PRF,  where  terms  are  extracted  from  the  initial  set  of 
retrieved  tweets  while  neglecting  embedded  hyperlinks. 

Hyperlinked  document  titles  level  (HPRF):  where  the 
page  titles  of  the  hyperlinked  documents  in  feedback  tweets 
are  extracted  and  integrated  to  tweets  for  term  extraction  in 
the  PRF  process. 

Titles  and  meta-description  of  hyperlinked  documents  may 
include  unneeded  text.  For  example,  titles  usually  contain 
delimiters  like  or  ‘|’  before/after  page  domain  name,  e.g.,  “...  | 
CNN.com”  and  “...  -  YouTube”.  We  clean  these  fields  through 
the  following  steps  [4,  5] : 

•  Split  page  titles  on  delimiters  and  discard  the  shorter 
substring,  which  is  assumed  to  be  the  domain  name. 

•  Detect  error  page  titles,  such  as  “404,  page  not  found!” 
and  consider  them  broken  hyperlinks. 


•  Remove  special  characters,  URLs,  and  snippet  of 
HTML/JavaScript/CSS  codes. 

This  process  helps  in  discarding  terms  that  are  potentially  harmful 
if  used  in  query  expansion. 

TFIDF  [8]  and  Okapi  [12]  weighting  were  used  for  ranking  the 
top  terms  were  used  for  query  expansion.  We  calculate  TFIDF  for 
a  term  x  as  follows: 

TFIDF (x)  =  [t/t(x)  +  dxtfht{x)  +  d2tfhd{x)]  ■  (1) 

where  tft(x)  is  the  term  frequency  of  term  x  in  the  top  nd  initially 
retrieved  tweet  documents  used  in  the  PRF  process;  tfht(x)  is  the 
term  frequency  of  term  x  in  the  titles  of  hyperlinks  in  the  top  nd 
tweets;  and  tfhd(x)  is  the  term  frequency  of  term  x  in  the  meta¬ 
description  of  hyperlinks  in  the  top  nd  tweets.  d1  and  d2  are  binary 
functions  that  equal  to  0  or  1  according  to  the  content  level  of 
hyperlinked  documents  used  in  the  expansion  process.  df(x)  is 
document  frequency  of  term  x  in  the  collection;  and  N  is  the  total 
number  of  documents  in  the  collection. 

/q  and  b  free  parameters  of  the  Okapi  weighting  were  selected  as 
2  and  0  respectively.  The  parameter  b  was  set  to  0  since  the 
variation  in  tweets  length  is  limited  due  to  Twitter  constraint  on 
the  number  of  characters  used  (max.  140  characters). 

Terms  extracted  from  the  top  nd  initially  retrieved  documents  are 
ranked  according  to  equation  1,  and  top  nt  terms  with  the  highest 
TFIDF  are  used  to  formulate  QE  for  the  expansion  process. 
Weighted  geometrical  mean  is  used  to  calculate  the  final  score  of 
retrieval  for  a  given  query  Q  according  to  equation  2: 

P(Q\d )  =  /HQ old)1""  ■  (2) 

where  Q0  is  the  original  query;  QE  is  the  set  of  extracted 
expansion  terms;  P(Q\d )  is  the  probability  of  query  Q  to  be 
relevant  to  document  d\  and  a  is  the  weight  given  to  expansion 
terms  compared  to  original  query  (when  a  =0,  no  expansion  is 
applied).  Language-model-based  retrieval  model  was  used  to 
calculate  the  probability  of  relevance. 

2.2  Tweets  Re-ranking 

Similar  to  our  idea  in  TREC2013  [4],  we  also  explored  to 
ensemble  multiple  ranking  models  for  re-ranking  the  retrieved 
tweets.  Our  models  were  learned  using  Tweets2011-13  qrels  and 
tested  with  Tweets2014  queries.  We  employed  six  learning  to 
rank  algorithms  as  the  candidate  rankers  for  search  result  fusion: 
RankNet  [2],  RankBoost  [6],  Coordinate  Ascent  [10],  MART  [7], 
LambdaMART  [14]  and  RandomForests  [1]  using  RankLib 


package1.  Based  on  these  algorithms,  we  trained  eight  different 
rankers:  (1)  A  Rankboost  model  was  trained  without  validation 
set;  (2)  A  MART  model  was  learned  using  80%  training  queries 
for  training  and  20%  training  queries  for  validation;  (3)  A 
RandomForest  model  was  learned  in  the  same  way  as  (2);  (4)  A 
RankNet  model  was  learned  in  the  same  way  as  (2);  (5)  Two 
Coordinate  Ascent  models  were  learned  in  the  same  way  as  (2) 
but  one  of  them  optimized  MAP  and  the  other  optimized  P@30; 
(6)  Two  LambdaMART  models  were  learned  in  the  same  way  as 
(5).  Different  from  the  configurations  of  last  year,  we  did  not  use 
query  selection  methods  to  construct  validation  set  since  this 
strategy  did  not  bring  much  effectiveness  to  our  system  of 
TREC2013  [4].  However,  we  used  exactly  the  same  feature  list  as 
last  year  which  were  shown  useful  (see  [4]  for  detail). 

Last  year,  we  simply  summated  relevance  scores  of  all  learning- 
to-rank  models  for  tweets  re-ranking.  Instead  of  that,  we  tried  to 
combine  the  ranking  scores  of  candidate  rankers  by  weighted 
Condorcet-fuse  this  year.  Condorcet-fuse  is  one  of  the  state-of- 
the-art  fusion  methods  in  metasearch  due  to  its  effectiveness  [11]. 
The  basic  idea  is  that  tweets  that  can  beat  more  tweets  in  a  pair¬ 
wise  manner  based  on  scores  they  received  from  candidate 
rankers  should  be  ranked  higher.  Taking  ranked  lists  generated  by 
candidate  rankers  as  input,  we  produced  a  Condorcet  graph  and 
output  the  final  ranked  list  by  computing  the  Hamiltonian  path  of 
that  graph. 

The  workflow  of  generating  Condorcet  graph  is  demonstrated  in 
Figure  2.  Given  four  candidate  rankers  and  three  tweets,  we  have 
relevance  scores  for  tweets  assigned  by  rankers  which  form  a 
ranker-tweets  matrix  shown  in  the  first  frame.  (rh  tj)  stands  for  the 
relevance  score  given  by  candidate  ranker  rt  to  tweet  tj.  We  then 
derive  the  tweet-tweet  relation  matrix  to  reveal  the  pair-wise 
preference.  For  a  pair  of  tweets  (tj,  tk ),  we  compute  their  relation 
score  by  counting  the  number  of  rankers  giving  higher  score  to  tj 
than  tk.  And  thirdly,  we  generate  the  Condorcet  graph.  For  a  pair 
of  tweets  tj  and  tk,  there  exists  an  edge  from  tj  to  tk  if  the  value  of 
(tj,  tk)  in  tweet-tweet  relation  matrix  is  higher  than  or  equal  to  0. 
For  the  tweets  that  tie,  there  is  an  edge  pointing  in  each  direction. 
A  Hamiltonian  traversal  of  this  graph  will  produce  the  final 
ranked  list.  The  detail  of  the  algorithm  can  be  found  in  [1 1]. 

To  reflect  the  different  importance  of  candidate  rankers,  we 
implemented  a  weighted  version  of  Condorcet-fuse.  In  this  case,  tj 
wins  tk  if  the  sum  of  the  weights  of  those  rankers  that  rank  tj 
higher  than  tk  is  larger  than  the  sum  of  the  weights  of  those  that 
prefer  tk  to  tj.  We  used  the  mean  average  precision  (MAP) 
obtained  by  individual  candidate  ranker  on  Tweets20 11-2013 
dataset  as  the  weight  of  the  corresponding  ranking  model. 

2.3  Submitted  Runs  &  Results 

We  had  four  submitted  runs  to  the  ad-hoc  search  task  this  year,  as 
follows: 

-  PRF1030:  Applied  standard  pseudo -relevance  feedback  with 
number  of  documents  in  feedback  =10,  and  number  of  terms 
in  the  feedback  process  =  30.  Selection  of  values  is  based  on 
our  study  to  different  values  of  feedback  documents  and 
terms  in  [5]. 

-  HPRF1020:  Applied  Hyperlink-based  PRF  with  number  of 
document  and  terms  used  in  feedback  =  10  and  20 
respectively. 


1  http  ://sourceforge.net/p/lemur/wiki/RankLib/ 


Table  1  QCRI  results  in  TREC  2014  Microblog  track  for 
the  ad-hoc  search  task 


Run 

MAP 

P@30 

PRF  1030 

0.4941 

0.6679 

HPRF  1020 

0.5075 

0.6685 

PRF1030RR 

0.4998 

0.6988 

HPRF1020RR 

0.5122 

0.6982 

-  PRF1030RR:  PRF 1030  run  after  applying  reranking 

-  HPRF1020RR:  HPRF1020  run  after  applying  reranking 

Results  achieved  by  our  runs  are  presented  in  Table  1. 

Results  shows  that  HPRF  led  to  slight  improvement  over  just 
using  PRF  on  both  MAP  and  P@30.  This  improvement  was  found 
insignificant,  which  does  not  align  with  results  reported  on  TREC- 
2013  dataset  [5].  However,  reranking  led  to  noticable  improvemet 
to  P@30,  with  slight  improvement  to  MAP.  Our  best  achieved 
scores  are  highlighted  in  Table  1. 

3.  TWEETS  TIMELINE  GENERATION 
TASK 

3.1  Approach 

Our  expectation  was  that  HPRF1020RR  would  achieve  the  best 
result;  this  is  why  we  used  this  run  for  the  TTG  task. 

For  generating  the  timeline  of  tweets,  we  applied  the  following: 

1.  Top  ranked  N  tweets  were  normalized  by  removing  name 
mentions,  hashtags,  urls,  emoticons,  and  stopwords. 

2.  Porter  stemmer  was  applied  to  tweets’  text 

3.  Similarity  was  calculated  among  top  N  tweets  in  the  results 
list. 

4.  INN  clustering  approach  was  applied  to  merge  any  tweets 
with  close  distance  into  the  same  cluster.  Distance  between 
two  tweets  was  calculated  as  follow: 

distance(ti,tj )  =  1  —  similar  ity  (nor  m(ti),  nor  m(tj)) 

where  norm(ti)  is  the  normalized  version  of  the  tweet  tj  after 
applying  step  1  and  2. 

We  applied  two  implementations  to  the  similarity,  which  are  a 
modification  to  the  Jaccard  similarity  coefficient  as  follows: 


similarity em(A,  B) 


\A  HB\ 
max(|j4|,  |B|) 


similaritySM  ( A ,  B ) 


\A  nB| 
min(|/l|,|£?|) 


similarityEM  calculates  the  similarity  between  the  text  of  two 
tweets  as  the  number  of  common  terms  divided  by  the  length  of 
the  longest  tweet.  This  leads  to  merging  two  tweets  in  the  same 
cluster  if  most  of  the  terms  in  the  long  tweet  existed  in  the  short 
tweet,  and  the  difference  in  the  length  between  both  tweets  is  not 
large.  similaritySM  leads  to  severe  merging,  since  it  focus  on 
how  many  of  the  terms  of  the  short  tweet  exist  in  the  long  tweet 
without  regard  to  the  difference  in  length.  In  the  extreme  case,  if  a 
tweet  contains  only  one  word  that  exists  in  the  long  tweet, 
similarity SM  would  equal  to  1. 


Table  2  QCRI  results  in  TREC  2014  Microblog  track  for  the 
TTG  task 


Run 

P 

^uw 

Rw 

Fluw 

Flw 

EM50 

0.4150 

0.2867 

0.4779 

0.3391 

0.4442 

EM100 

0.3301 

0.3797 

0.5650 

0.3532 

0.4167 

SM50 

0.4798 

0.1688 

0.3221 

0.2497 

0.3854 

SM100 

0.3881 

0.2057 

0.3416 

0.2689 

0.3634 

3.2  Submitted  Runs  &  Results 

We  had  four  submitted  runs  to  the  ad-hoc  search  task  this  year,  as 
follows: 

-  EM50:  Top  50  retrieved  results  from  the  HPRF1020RR  run 
were  clustered  using  similarity EM  as  the  distance  function. 
A  similarity  of  at  least  0.6  was  required  to  any  of  the  tweets 
in  a  cluster  to  get  the  tweet  merged  to  the  cluster. 

-  EM100:  similar  to  EM50,  but  top  100  retrieved  results  were 
used  instead. 

-  SM50:  similar  to  EM50,  but  similarity SM  was  used  instead. 

-  SM100:  similar  to  EM100,  but  similaritySM  was  used 
instead. 

For  all  runs,  the  earliest  tweet  in  each  cluster  is  used  to  represent 
the  cluster  in  the  submitted  run. 

Results  of  our  TTG  runs  are  shown  in  Table  2.  The  second 
similarity  formula  similarity SM  led  to  merging  most  of  the 
tweets  into  a  small  number  of  clusters.  This  led  to  low  recall  but 
higher  precision  as  compared  to  using  similarity EM.  However, 
the  overall  FI  score  was  much  lower  than  using  similarity EM. 
EM  100  achieved  a  better  unweighted  FI  measure,  while  EM50 
achieved  a  better  weighted  FI  measure,  which  according  to  the 
scatter  plot  of  all  submitted  runs,  achieved  the  4th  rank  among  48 
runs. 
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